Indexing in EMu

EMu has a number of indexing methods for efficient and timely access to data. An indexing method is an algorithm or set of rules to search data in an indirect way. The simplest type of indexing, known as an exhaustive search, is no index at all. In this case, each record is read sequentially and compared against the search terms entered. If there is a match, the record is added to the set of matching records, then the next record is read. The exhaustive search method is very space efficient as only the data needs to be stored. However, a search may take some time to complete if there is a very large number of records (perhaps several hours for 750,000 records).

To facilitate the search of large numbers of records, indexes are built that provide rapid access to data that match given search criteria. Indexes provide an indirect means of searching data in a judicious manner: when a search term is entered, the indexes are consulted to produce the matching records. There is a cost associated with indexing: the need to store indexing information along with the data.

There are a large number of indexing methods available to designers of databases, each one with associated pros and cons. The EMu database engine employs two flexible indexing methods to provide rapid retrieval of data from large numbers of records:

  • The first is known as linear hashing and provides high speed key retrieval.
  • The second goes by the long name of A two level superimposed coding scheme for partial match retrieval, (shortened to the Two Level method) and provides a general purpose framework for implementing a wide range of term based searches. A term is simply a sequence of characters that forms the basic entity for searching. For example, in word based searching (where you need only enter a word to find matching records), a term is a word.

A number of pre-configured indexing options are distributed with EMu. In particular a number of fields that contain name based data already have phonetic based indexing enabled. Also many descriptive fields (e.g. Notes) have stem based indexing set.

A third indexing method, available with EMu 8.0, allows use of Apache Solr for searching rather than the default Texpress based indexing. If Solr indexing is enabled, Solr does not replace Texpress as the underlying database engine used by EMu, rather it replaces the Texpress searching component; Texpress effectively outsources its searching component to Solr, and Solr feeds back results to Texpress for disbursement. Details here.

It is possible to adjust indexing via Registry entries. These entries allow institutions to tune indexing methods to provide the most efficient searching possible without wasting disk space on unused methods.

Users in group Admin can use an Admin Task to view indexing information: